Efficient pedigree recording for fast population genetics simulation

Jerome Kelleher, Kevin Thornton, Jaime Ashander, and Peter Ralph (me)

12 March 2018 :: bioRxiv

This talk: slides here

  1. what are tree sequences and what are they good for
  2. explain application to forwards simulation recording
  3. display impressive speedups

The tree sequence

History is a sequence of trees

For a set of sampled chromosomes, at each position along the genome there is a genealogical tree that says how they are related.

Trees along a chromosome
Trees along a chromosome

A tree sequence describes this, er, sequence of trees.

Observations:

  1. The pedigree (parental relationships) plus crossover locations would give us the tree sequence for everyone, ever.

  2. Much less can fully describe the history relevant to a sample of genomes.

  3. This information is equivalent to the Ancestral Recombination Graph (ARG).

Kelleher, Etheridge, and McVean introduced the tree sequence data structure for a fast coalescent simulator, msprime.

  • stores genealogical and variation data very compactly

  • efficient algorithms available:

    • subsetting
    • calculation of allele frequencies in arbitrary cohorts
    • linkage disequilibrium
    • log-time haplotype matching
  • tree-based sequence storage closely related to haplotype-matching compression

Simulated file sizes

file sizes
file sizes
  • HapMap chr1 genetic map (250Mb)
  • Gutenkunst et al out-of-Africa model (3 pops)
  • mutation rate 2 × 10 − 8 per gen
  • at n = 107

    • about 17 million variants
    • VCF size: 318 TiB (250,000× larger)

Example: three samples; two trees; two variant sites

Example tree sequence
Example tree sequence

Storing a tree sequence in the four tables - nodes, edges, sites, and mutations - is succinct (no redundancy).

These are stored efficiently (hdf5) on disk with a bit more information (e.g., metadata).

Nodes and edges

Edges

Who inherits from who; only necessary for coalescent events.

Records: interval (left, right); parent node; child node.

Nodes

The ancestors those happen in.

Records: time ago (of birth); ID (implicit).

Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence
Building a tree sequence

Sites and mutations

Mutations

When state changes along the tree.

Records: site it occured at; node it occurred in; derived state.

Sites

Where mutations fall on the genome.

Records: genomic position; ancestral (root) state; ID (implicit).

Adding mutations
Adding mutations
Adding mutations
Adding mutations
Adding mutations
Adding mutations
Adding mutations
Adding mutations
Adding mutations
Adding mutations

Forwards simulations

Why forwards simulations?

Coalescent simulations are much faster than forwards-time, individual-based simulations

because they don’t have to keep track of everyone, only the ancestors of your sample.

But: selection, or sufficient geographic structure, break the assumptions of coalescent theory.

Geography or selection break coalescent theory

So, if you

  1. have more than a couple of loci under selection, and/or
  2. have fine enough scale geography that demographic fluctuations are important (e.g., continuous space)

then you have to do forwards-time, individual-based simulations.

To model linked selection, we need chromosome-scale simulations.

Then every individual needs to carry around her genotype (somehow). Even at neutral sites!

Bummer.

But wait…

Forwards-time tree sequence recording

The main idea

If we record the tree sequence that relates everyone to everyone else,

after the simulation is over we can put neutral mutations down on the trees.

Since neutral mutations don’t affect demography,

this is equivalent to having kept track of them throughout.

This means recording the entire genetic history of everyone in the population, ever.

It is not clear this is a good idea.

Tree recording strategy

Every time an individual is born, we must:

  1. add each gamete to the Node Table,
  2. add entries to the Edge Table recording which parent each gamete inherited each bit of genome from, and
  3. add any new selected mutations to the Mutation Table and (if necessary) their locations to the Site Table.
Rightarrow
Rightarrow

This produces waaaaay too much data.

We won’t end up needing the entire history of everyone ever,

but we won’t know what we’ll need until later.

How do we get rid of the extra stuff?

Simplification

Question: given a tree sequence containing the history of many individuals, how do we simplify it to only the history of a subset?

Concretely, given an input tree sequence and a subset of its nodes we call the samples, we want a new tree sequence for which:

  1. All marginal trees match the corresponding subtree in the input tree sequence.

  2. Every non-sample node in marginal trees has at least two children.

  3. All nodes and edges are ancestral to at least one sample.

  4. No adjacent redundant edges (e.g., (ℓ, x, p, c) + (x, r, p, c) → (ℓ, r, p, c)).

Answer: to simplify a tree sequence to the history of the samples:

  1. Paint each sampled chromosome a distinct color.

  2. Moving back up the tree sequence, copy colors of each chromosome to the parental chromosomes they inherited from.

  3. If two colors go in the same spot (coalescence), replace with a new color (unique to that ancestor). Output a node for the ancestor and an edge for the coalescence.)

  4. Once all colors have coalesced in a given segment, stop propagating it.

An example: simplify these to J and K

Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example
Simplify example

Wright-Fisher, N=10: before simplification

Wright-Fisher tree sequence
Wright-Fisher tree sequence

Wright-Fisher, N=10: before simplification

Wright-Fisher tree sequence
Wright-Fisher tree sequence

… and after simplification

Simplified Wright-Fisher tree sequence
Simplified Wright-Fisher tree sequence

Revised tree recording strategy

Every time an individual is born, we must:

  1. add each gamete to the Node Table,
  2. add entries to the Edge Table recording which parent each gamete inherited each bit of genome from
  3. add any new mutations to the Mutation Table and (if necessary) their locations to the Site Table.

… and,

  1. Every so often, simplify the tables so far, retaining the history of the current generation.

Implementation and results

Benchmark implementation

  • Recording, simplifying, and output of tables: C code in msprime.

  • Simulation: fwdpp, by Kevin Thornton (in C++) (code)

  • Glue: pybind11 and numpy

  • Machine: Ubuntu / 2x 2.6 GHz Intel E5-2650 CPU

Other implementations:

Simulation parameters

  1. Wright-Fisher population of size N
  2. simulated for 10N genreations
  3. neutral mutation rate μ equal to recombination rate r per gamete
  4. many, weakly deleterious mutations: rate μ/100 with s exponentially distributed with mean 2.5/N.

Note: if we recorded tree sequences (“pedigree recording”) then the neutral mutation rate was zero but neutral mutations were added afterwards.

Total run time per single simulation as a function of region length.
Total run time per single simulation as a function of region length.
Relative speedup of simulations
Relative speedup of simulations

Memory use

RAM requirements are determined by how often you simplify.

Moving forward

tskit : a toolkit for tree sequences

Tools in msprime can do these things “very fast”:

  1. Read in: tables tree sequence
  2. Write out: tree sequence tables
  3. Iterate over trees,
  4. while computing some statistic (AFS, π, f4, LD, ).
  5. Simplify (i.e., subset).

Upcoming: will be moved to tskit.

tsinfer :: real data in tree sequences

In progress: tsinfer (Kelleher, Wong, and McVean) infers tree sequences from real genomic data.

Watch Jerome’s talk: Simulating, storing & processing genetic variation data w/millions of samples

Tree sequences …

  1. are compact, useful ways to store population history including genome sequence.

  2. can be succinctly encoded in a set of tables, which we provide tools for using.

  3. can be output during a forwards-time simulation,

  4. which not only gets you trees in the end, but also makes much larger simulations possible.

Future uses

  1. Machine learning needs good, fast simulations to train on.

  2. Tree sequences allow quick computation of many genomic statistics from real data.

Current work: tree sequence recording in SLiM (with Jared Galloway).

Thanks

Acknowledgements

Jerome Kelleher, Jaime Ashander, and Kevin Thornton.

Funding: NSF (PR); Wellcome Trust (JK); NIH (KRT); USF&WS (JDA).

Slides with reveal.js and pandoc.